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Abstract — Optical systems which measure independent random 
projections of a scene according to compressed sensing (CS) 
theory face a myriad of practical challenges related to the size 
of the physical platform, photon efficiency, the need for high 
temporal resolution, and fast reconstruction in video settings. 
This paper describes a coded aperture and keyed exposure 
approach to compressive measurement in optical systems. The 
proposed projections satisfy the Restricted Isometry Property 
for sufficiently sparse scenes, and hence are compatible with 
theoretical guarantees on the video reconstruction quality. These 
concepts can be implemented in both space and time via 
either amplitude modulation or phase shifting, and this paper 
describes the relative merits of the two approaches in terms of 
theoretical performance, noise and hardware considerations, and 
experimental results. Fast numerical algorithms which account 
for the nonnegativity of the projections and temporal correlations 
in a video sequence are developed and applied to microscopy and 
short-wave infrared data. 

Index Terms — Compressive sensing, coded apertures, convex 
optimization, sparsity, coded exposure, Toeplitz matrices 

I. Introduction 

THE theory of compressed sensing (CS) suggests that we 
can collect high-resolution imagery with relatively few 
photodetectors or a small focal plane array (FPA) when the 
scene is sparse or compressible in some dictionary or basis and 
the measurements are chosen appropriately 0, 0. There has 
been significant recent interest in building imaging systems 
in a variety of contexts to exploit CS (cf. 0, 0, 0, 0, 
0, 0, 0, HO), H2, D2). By designing optical sensors to 
collect measurements of a scene according to CS theory, we 
can use sophisticated computational methods to infer critical 
scene structure and content. One particularly famous example 
in optical imaging is the Single Pixel Camera lfl3l . which 
collects individual projections of the scene sequentially. While 
these measurements are supported by the CS literature, there 
are several practical challenges associated with the tradeoff 
between temporal resolution physical system footprint. In this 
paper we describe an alternative approach to designing a low 
frame-rate snapshot CS camera which naturally parallelizes 
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the compressive data acquisition. Our approach is based on 
two imaging techniques called coded apertures and keyed 
exposures, which we explain next. 

Coded apertures lfl4l . ITT31 have a long history in low- 
light astronomical applications. Coded apertures were first 
developed to increase the amount of light hitting a detector 
in an optical system without sacrificing resolution (by, say, 
increasing the diameter of an opening in a pinhole camera). 
The basic idea is to use a mask, i.e., an opaque rectangu- 
lar plate with a specified pattern of openings, that allows 
significantly brighter observations with higher signal-to-noise 
ratio than those from conventional pinhole cameras 1 14|, 1 15 1. 
These masks encode the image before detection, inducing a 
more complicated point spread function than that associated 
with a pinhole aperture. The original scene is then recovered 
from the encoded observations in post-processing using an 
appropriate reconstruction algorithm which exploits the mask 
pattern. These multiplexing techniques are particularly popular 
in astronomical fl6l . ifTTI and medical ifTSl . Ifl9l . BOl appli- 
cations because of their efficacy at wavelengths where lenses 
cannot be used, but recent work has also demonstrated their 
utility for collecting both high resolution images and object 
depth information simultaneously ED . 

Keyed exposure (also called coded exposure l22l . flutter 
shutter 1231 , or coded strobing |24|) imaging was initially 
developed to facilitate motion deblurring in video using a 
relatively low frame rate. In some cases motion has been 
inferred from a single keyed exposure snapshot. The basic idea 
is that the camera sensor continuously collects light while the 
shutter is rapidly opened and closed; the shutter movement 
effectively modulates the motion blur point spread function, 
and with well-chosen shutter movement patterns it becomes 
possible to deblur moving objects. Similar effects can be 
achieved using a strobe light instead of moving a shutter. 

Despite the utility of the above methods in specific settings, 
they both face some limitations. The design of conventional 
coded apertures does not account for the inherent structure 
and compressibility of natural scenes, nor the potential for 
nonlinear reconstruction algorithms. Likewise, existing keyed 
exposure methods focus on direct (uncoded) measurements of 
the spatial content of the scene and have limited reconstruction 
capabilities, as we detail below. The Coded Aperture Keyed 
Exposure (CAKE) sensing paradigm we propose in this paper 
is designed to allow nonlinear high-resolution video recon- 
struction from relatively few measurements in more general 
settings. 
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A. Problem Formulation 

We consider the problem of reconstructing an iV-frame 
video sequence /*, where each frame is an n\ x n-i two- 
dimensional image denoted /*. Using standard vector repre- 
sentation, we have that f£ € R™ for t = 1, ...,N where 
n = niU2 is the total number of pixels. As a result, the vector 
representation of the video sequence is /* = (/*, . . . , /jy) £ 
W lN . 

The observations y of /* are also acquired as a video 
sequence. We do not assume that the observations are acquired 
at the same rate at which we will ultimately reconstruct /*. 
In general, we assume y is an M-frame video sequence, with 
each frame y% of size mi x ni2- Similarly to /*, we have 
y k £ W n for k = 1, . . . , M, where m = mim 2 , therefore 

y = (yi,...,y M ) eR mM . 

We observe /* via a spatio-temporal sensing matrix A £ 
^rnMxnN which linearly projects the spatio-temporal scene 
onto an mM-dimensional set of observations: 

y = Af* + w, (1) 

where w £ R mM is noise associated with the physics of the 
sensor. 

CS optical imaging systems must be designed to meet 
several competing objectives: 

• The sensing matrix A must satisfy some necessary criterion 
(such as the RIP, defined below) which provides theoretical 
guarantees on the accuracy with which we can estimate /* 
from y. 

• The total number of measurements, mM, must be lower 
than the total number of pixels to be reconstructed, nN . 
This is achievable via compressive spatial acquisition 
(m < n), frame rate reduction (M < N), or simultaneous 
spatio-temporal compression. 

• The measurements y must be causally related to the 
temporal scene /*, which restricts the structure of the 
projections A. 

• The optical measurements modeled by A must be imple- 
mentable in a way that results in a smaller, cheaper, more 
robust, or lower power system. 

• The sensing matrix structure must facilitate fast reconstruc- 
tion algorithms. 

This paper demonstrates that compressive Coded Aperture 
Keyed Exposure systems achieve all these objectives. 

B. Contributions 

The primary contribution of this paper is the design and 
theoretical characterization of compressive Coded Aperture 
Keyed Exposure (CAKE) sensing. We explore amplitude 
modulating and phase shifting masks and describe theoretical 
and implementation aspects of both. We further describe how 
keyed exposure ideas can be used in conjunction with coded 
apertures to increase both the spatial and temporal resolution 
of video from relatively few measurements. We prove hitherto 
unknown theoretical properties of such systems and demon- 
strate their efficacy in several simulations. In addition, we 
discuss several important algorithmic aspects of our approach, 



including a mean-subtraction pre-processing step which allows 
us to sidestep challenging theoretical aspects associated with 
nonnegative sensing matrices (such as amplitude modulating 
coded apertures). This paper builds substantially upon earlier 
preliminary studies by the authors ll25ll . Il26ll . E7\ and related 
independent work by Romberg [28 1. 

C. Organization of the Paper 



The paper is organized as follows. Section II-A describes 
conventional coded aperture imaging techniques and Sec- 
tion II-B describes keyed exposure techniques currently in the 



literature. We describe the compressive sensing problem and 
formulate it mathematically in Section |II-C| In Section [ill] 
we show how CS theory can be used for constructing coded 
aperture masks that can easily be implemented for improving 
image reconstruction resolution in snapshot imaging; this 
includes theoretical results, a discussion of implementation 
details and tradeoffs in practical optical systems, and experi- 
mental results. We consider applications to video compressed 



sensing using the full CAKE paradigm in Section IV includ- 
ing theory and experimental results. 

II. Background 

Prior to detailing our main contributions, we first review 
pertinent background material. This review touches upon the 
development of coded aperture imaging, coded exposure pho- 
tography, and a brief review of salient concepts in compressed 
sensing theory. 

A. Coded Aperture Imaging 

Seminal work in coded aperture imaging includes the de- 
velopment of masks based on Hadamard transform optics l29ll 
and pseudorandom phase masks 11301 . Modified Uniformly 
Redundant Arrays (MURAs) 0T1 are generally accepted as 
optimal mask patterns for coded aperture imaging. These 
mask patterns (which we denote by /j MURA ) are binary, 
square patterns, whose grid size matches the spatial resolu- 
tion of the photo-detector. Each mask pattern is specifically 
designed to have a complementary pattern /j MURA such that 
/i MURA * /jMURA is a sm gi e wjtjj f[ at side-lobes (i.e., a 
Kronecker 6 function). 

In practice, the resolution of a detector array dictates the 
properties of the mask pattern and hence resolution at which 
/* can be reconstructed. We model this effect as /* being 
downsampled to the resolution of the detector array and then 
convolved with the mask pattern /i MURA , which has the same 
resolution as the FPA and the downsampled /*, i.e., 



V = (An,/*) * h 



MURA 



W, 



(2) 



where * denotes circular convolution, w corresponds to noise 
associated with the physics of the sensor, and Ant/* is the 
integration downsampling of the scene, which consists of 
partitioning /* into uniformly sized di x di blocks, where 
di = ni/rrii for i = 1, 2, and measuring the total intensity in 
each block. 
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Because of the construction of h MVRA and h MVRA , Ant/* 
can be reconstructed using 



/ = y*h 



MURA 



However, the resulting resolution is often lower than what is 
necessary to capture some of the desired details in the image. 
Clearly, the estimates from MURA reconstruction are limited 
by the spatial resolution of the photo-detector. Thus, high 
resolution reconstructions cannot generally be obtained from 
low-resolution MURA-coded observations. 

It can be shown that this mask design and reconstruction 
result in minimal reconstruction errors at the FPA resolution 
and subject to the constraint that linear, convolution-based 
reconstruction methods would be used. However, when the 
scene of interest is sparse or compressible, and nonlinear 
sparse reconstruction methods may be employed, then CS 
ideas can be used to design coded aperture which yield higher 
resolution images. Before describing the details of this, we 
briefly review two key relevant concepts from CS. 

B. Coded (Keyed) Exposure Imaging 

Coded (or keyed) exposures were developed recently in the 
computational photography community. Initial work in this 
area was focused on engineering the temporal component of 
a motion blur point spread function by rapidly opening and 
closing the shutter during a single exposure or a small number 
of exposures at a low frame rate [22], [23 1 . That is, 



y = A™f* 



= £/* 



w, 



ies 



where the keyed exposure (KE) measurement matrix A KE 
selects the subset of frames during which the shutter is open. 
We refer to this subset as the exposure code S C {1, . . . , N}. 

If an object is moving during image acquisition, then a static 
shutter would induce a typical motion blur, making the moving 
object difficult to resolve with standard deblurring methods. 
However, by "fluttering" the shutter during the exposure 
using carefully designed patterns, the induced motion blur 
can be made invertible and moving objects can be accurately 
reconstructed. Instead of a moving shutter, more recent work 
uses a strobe light to produce a similar effect |24l . 

While this novel approach to video acquisition can produce 
very accurate deblurred images of moving objects, there is sig- 
nificant overhead associated with the reconstruction process. 
To see why, note that every object moving with a different 
velocity or trajectory will produce a different motion blur. This 
means that (a) any stationary background must be removed 
during preprocessing and (b) multiple moving objects must be 
separated and processed individually. 

More recently, it was shown that these challenges could 
be sidestepped when the video is temporally periodic (e.g., 
consider a video of an electronic toothbrush spinning) |24|. 
The periodic assumption amounts to a sparse temporal Fourier 
transform of the video, and this approach, called coded strob- 
ing, is a compressive acquisition in the temporal domain. As a 
result, the authors were able to leverage ideas from compressed 
sensing to achieve high-quality video reconstruction. 



The assumption of a periodic video makes it possible to 
apply much more general reconstruction algorithms that do 
not require background subtraction or separating different 
moving components. However, it is a very strong assumption, 
which places some limits in its applicability to real-world 
settings. The approach described in this paper has similar 
performance guarantees but operates on much more general 
video sequences. 

C. Compressed Sensing 

In this section we briefly define the Restricted Isometry 
Property (RIP) and explain its significance to reconstruction 
performance. In subsequent sections, we demonstrate our 
primary theoretical contribution, which is to prove the RIP 
for compressive coded aperture and keyed exposure systems. 

Definition 1 (Restricted Isometry Property (RIP) l32l ). A 

matrix A satisfies the RIP of order s if there exists a constant 
5 S G (0, 1) for which 



(l-S s )\\f\\i<\\Af\\ 2 2 <(l + S s )\\f\\. 



(3) 



holds for all s-sparse f € WL n . If this property holds, we say 
A is RIP{s,S s ). 

Matrices which satisfy the RIP are called CS matrices; 
and when combined with sparse recovery algorithms, they 
are guaranteed to yield accurate estimates of the underlying 
function /*: 

Theorem 1 (Sparse Recovery with RIP Matrices E3), P4l ). 

Let A be a matrix satisfying RIP(2s, 62s) with 62s < v2 — 1, 
and let y = Af + w be a vector of noisy observations of 
any signal f £ R n , where the w is a noise or error term 
with \\w\\2 < e for some e > 0. Let f s be the best s-sparse 
approximation of f; that is, f s is the approximation obtained 
by keeping the s largest entries of f and setting the others to 
zero. Then the estimate 



obeys 



f = argmin ||/||i 

subject to \\y - Af\\ 2 < e, 

||/-/||2<Ci,.e + Ci,. " / " 



(4) 



where Ci jS and C2, s o.re constants which depend on s but not 
on n or m. 

Note that the reconstruction Q in Theorem [T] is equivalent to 

/ = argmin §|| V - +r||/|| a (5) 

where t > 0, which depends on e, can be viewed as a 
regularization parameter. 

III. Compressive Coded Apertures for Snapshot 
Imaging 

This section first considers a snapshot acquisition model. 
Our goal is to recover a static high-resolution scene from a 
single image where all pixels are collected simultaneously. In 



terms of the notation in Sec. |I-A| we have N = M = 1, so that 
A is of size mxn. The sensing matrix for compressive coded 
aperture (CCA) systems can be modeled mathematically as 



Af* = D(f**h); 



(6) 



where h is a coding mask, and D is a subsampling operator 
(detailed below). Here the coding mask, h, is at the size 
and resolution at which /* will be reconstructed; this is in 
contradistinction to the MURA system, in which /j MURA is at 
the size and resolution of the FPA. Thus in |6]), we model the 
measurements as the scene being convolved with the coded 
mask and then downsampled. 

Using a similar model, Romberg ll28l conducted related 
work concurrent and independent of our initial investigations 
|25|. We will summarize the key features of this model 



and compare with our approach in Sec. III-B While these 
models share common elements, we will see in Sec. |III-Cl that 
there are important tradeoffs associated with the theory and 
implementation of each strategy. 

Recent work by Bajwa et al. Il35l . (361, showed that random 
circulant matrices (and Toeplitz matrices, in general) are 
sufficient to recover sparse /* from y with high probability. 
In particular, they showed that a Toeplitz matrix whose first 
row contains elements drawn independently from a Gaussian 
distribution are RIP(s, S s ) when m > Cs 2 \og(n) for some 
constant C. Here, we extend these results to pseudo-circulant 
matrices and use them to motivate our mask design. Our 
model differs from that in |28l in that we consider a different 
generative model for the coded aperture mask, as well as a 
different subsampling strategy: 

• The elements of the coding mask h are generated iid 
according to a particular generating distribution (e.g., an 
appropriately scaled Rademacher, uniform, or Gaussian 
distribution). 

• Our analysis allows for deterministic downsampling in 
which we collect one sample per nonoverlapping block. 

In our notation, we distinguish mask patterns in this manner 
using the abbreviations BS, US, and GS, where the first 
character denotes the distribution used to generate the mask, 
i.e., binary, uniform, or Gaussian, and the second is a reminder 
that these masks are generated directly in the spatial domain. 

A. CCAs Generated in the Spatial Domain 

The two-dimensional convolution of h with an image /* 
as in (|6]l can be represented as the application of the Fourier 
transform to /* and h, followed by element-wise multiplica- 
tion and application of the inverse Fourier transform. In matrix 
notation, this series of linear operations can be expressed as 



vect(/* * h) = T- 1 diag(J7i).77* = Rf*, 



(7) 



where vect(/) is a vectorized representation of an image 
/, F is the two-dimensional Fourier transform matrix, and 
diag(Fh) is a diagonal matrix whose elements correspond 
to the transfer function, which is the Fourier transform of h. 
The matrix product R = F^ 1 di&g(Fh)F G R nxn is block- 
circulant and each block is in turn circulant. In matrix notation, 





Fig. 1. The nxn matrix T 1 diag(J r fe)J r is block-circulant with n,2 blocks 
in each row and column. Each block is n\ X n± and is circulant. 



R is consists of n 2 x n 2 blocks, 



R = 



R\ Rn 2 

R2 Ri 



R3 R2 
Ri R3 



(8) 



_R„ 2 Rn 2 -1 ■ ■ ■ R2 R\_ 
where each Rj E R niXni is circulant; i.e., Rj is of the form 



Rj = 



T 3-\ 



r i,3 r jt2 

r jA r J,3 

r ia 'Yi. 



(see Fig. [TJ. This block-circulant with circulant-block (BCCB) 
structure of R is a direct result of the fact that the fc-point 
one-dimensional Fourier transform Fk diagonalizes any k x k 
circulant matrix (such as Rj with k = n{) and so T = F n2 <g> 
F ni diagonalizes block-circulant matrices (such as R). Here 
® denotes the Kronecker matrix product. 

We now examine the generation of a compressed sensing 
matrix from the convolution R by restricting the number of 
measurements collected of the vector Rf*. Here we simply 
subsample the vector Rf* by applying a pointwise subsam- 
pling matrix Z) su b. The operation of applying D su \, consists of 
retaining only one measurement per uniformly sized d\ x c?2 
block, i.e., we subsample by di in the first coordinate and d-2 
in the second coordinate so that the result is an mi x m 2 image 
with mi — rii/di,i — 1. 2. In matrix form £> su b can be thought 
of as retaining a certain number of rows of the identity matrix. 
Because of the structure and deterministic nature of this type 
of downsampling, it is often more straightforward to realize 



in practical imaging hardware (see Sec. III-C 1 



The resulting projection matrix A is then given by 

A = D suh R = D^F- 1 diag(Fh)F. 

We now show that if the elements of h are drawn from an 
appropriate probability distribution, then A will be RIP(s,<5 s ) 
with high probability. 

Theorem 2 (Spatial-Domain CCA Sensing). Let h BS be a 

mask with entries generated i.i.d. according to the scaled 
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Rademacher distribution 



B. CCAs Generated in the Fourier Domain 



a k l ,k 2 



\J~dJn with probability 1/2, 
— y/d/n with probability 1/2, 



(9) 



for k\ = 1, . . . , m, fca = 1, • • • j n% Let A BS = D su j,R be an 
mxn matrix, where R £ M nxn is the BCCB matrix generated 
from h BS , and D su b € R mxn is the pointwise subsampling 
matrix. Then, there exists constants C\,c 2 > depending on 
5 S such that for any 



Hi > cis 2 log(n), 
A BS is RIP(s,5 s ) with probability exceeding 

l-2n 2 e- C2m / s2 . 



(10) 



(11) 



Remark 1. This binary-valued distribution was selected to 
model coded apertures with two states - open and closed - 
per mask element. We discuss the issue of implementing these 
masks in Section \III-C1\ 

Remark 2. It is straightforward to extend the result to other 
mask generating distributions, such as uniform and Gaussian, 
which we denote h and h , with associated sensing 
matrices A vs and A GS . 

We present the proof of Theorem [2] in Appendix [A] 
For successful recovery, the coded aperture masks are 
designed to be satisfy RIP(2s,(52 S ) as described in Q with 
high probability when m > cis 2 log(n). Note that this is 
somewhat weaker than what can be achieved by a purely 
unstructured i.i.d. randomly generated sensing matrix which 
satisfies ([3]) with high probability when m > Cislog(n/s) 
for some constant 5i > 0. Intuition may lead one to believe 
the extra factor of s is due to the fact that the m projections 
sensed using the amplitude modulated mask framework exhibit 
dependencies. However, in many settings this theoretical dis- 
advantage is offset by advantageous practical implementations. 

Sparse Gradients Scenes: The above theory is applicable to 
reconstructing /* when /* is sparse in the pixel basis; similar 
results hold when the gradient of /* is sparse. For simplicity 
of notation we will show then in a Id setting; the extension to 
2d is straightforward. Let V/ denote the first-order gradient 
of /, so that 



1 
-1 1 
-1 1 











... o 

'•. 

'•. 

-1 1 



(12) 



As noted in |35|, /* = V -1 ^, where 9 is the sparse gradient 
image. Thus if we sense y — AW f* + w = AW~ 1 9 + w = 
A9+w, we can expect to recover 9 using sparse reconstruction 
methods. 



The random convolution in 11281 follows the same structure 
as that in Sec. |IH-A| However, in that work the convolution 
is generated randomly in the frequency domain. More specifi- 
cally, the entries of the transfer function correspond to random 
phase shifts (with some constraints to keep the resulting 
observation matrix real-valued). We denote the resulting con- 
volution kernel as ft, UP , where "UP" refers to "uniform phase". 
For simplicity of presentation, we describe the generating 
distribution for a one-dimensional convolution for even-length 
masks. In particular, let a — Fh VF ', and generate a such that 
for k — 1 , . . . , n, 

±1 with equal probability if k = 1, 

with </>~W(0,2tt) if2<fc<n/2, 

±1 with equal probability if k = n/2 + 1 

X-fc+2 if n/2 + 2 <k<n. 

(13) 

Here U(0, 2tt) denotes the uniform distribution over [0,2tt]. 
The real-valued convolution kernel is then given by /i UP = 
F~ 1 a. This result can be extended easily for two-dimensional 
convolutions. 

To form a compressed sensing matrix from this random con- 
volution, [28 1 considers two different downsampling strategies: 
sampling at random locations, and random demodulation. The 
first method entails selecting a random subset Q C {1, . . . , n} 
of indices. We form a downsampling matrix £>q by retaining 
only the rows of the identity matrix indexed by SI, hence the 
resulting measurement matrix is given by 



,4 UP = DnF- 1 diagO)^, 



(14) 



where a is generated according to the two-dimensional analog 
of ( [13) . The random demodulation method multiplies the result 
of the random convolution by a random sign sequence s G 
{—1, 1}", such that Si = ±1 with equal probability for all 
i = l,...,n, then performs an integration downsampling of 
the result. Therefore in this case the measurement matrix is 



^4 = Antdiag(s)J : "- 1 diag(cr)J : '. 



(15) 



It can be shown that both of these strategies yield RIP- 
satisfying matrices with high probability. 

Theorem 3 (Fourier-Domain CCA Sensing). Let A VP be an 

mxn sensing matrix resulting from random convolution with 
phase shifts followed by a downsampling strategy, and let W 
denote an arbitrary orthonormal basis. If the downsampling 
is random subsampling as in ( |14| >, then there exists constant 
C3 > such that with probability exceeding 1 — 0(n _1 ), for 

Tn > C3<5~ 2 min(s log 6 n, s 2 log 2 n), 

A 111 ^ satisfies RIP(2s,S 2s ) with 6 2s < $■ If the down- 
sampling is random demodulation as in \\5\ , then there 
exists constant c 4 > such that with probability exceeding 
1 - 0(rr l ),for 

m > Ci5~ 2 min(s log 6 n, s 2 log n) , 

A VP W satisfies RIP(2s,6 2s ) with S 2s < S. 
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These theoretical results are stronger than those in The- 
orem [2] especially since they allow sparsity in arbitrary 
orthonormal bases. However, there are important differences 
between the observation models which have a significant 
impact on the feasibility and ease of hardware implementation. 



We elaborate on this in Section III-C Nevertheless, the theory 
developed in [28] lends important theoretical support to the 
general concept of compressive coded apertures. 

C. Hardware and Practical Implementation Considerations 

In this section we describe how we shift from modeling 
the coded aperture masks in a way that is compatible with 
compressed sensing theory, to a model that describes their 
actual implementation in an optical system. Our analysis does 
not account for the bandwidth of the lenses; in particular, 
we implicitly assume that the bandwidth of the lenses is 
high enough that band-limitation effects are negligible at the 
resolution of interest. In all of the hardware settings described 
below, precise alignment of the optical components {e.g., the 
mask and the focal plane array) is critical to the performance 
of the proposed system. Often a high-resolution FPA is helpful 
for alignment and calibration. 

In this paper we focus on incoherent light settings (con- 
sistent with many applications in astronomy, microscopy, and 
infrared imaging). In this case, the coded aperture must be 
real- valued and flux-preserving (i.e., the light intensity hitting 
the detection cannot exceed the light intensity of the source). 
In this section, we consider the following apertures: 

• Binary Spatial Mask: h G {0, l/n}" lX ™ 2 , drawn with 
equal probability, 

• Uniform Spatial Mask: h G [0, l/n]" lX ™ 2 , where each 
element is drawn independently from a uniform distribu- 
tion, or 

• Uniform Phase Mask: h = |.Fp| 2 for some p, where p 
corresponds to a phase-shifting mask in an incoherent light 
setting. 

1 ) Amplitude Modulation Masks: In a conventional lensless 
coded aperture imaging setup, the point spread function asso- 
ciated with the aperture is the mask pattern h itself. To shift a 
RIP-satisfying aperture as in Theorem [2] to an implementable 
aperture, one simply needs to apply an affine transform to 
h mapping [— y^d/n, ^d/n] to [0, 1/n]. This transform en- 
sures that the resulting mask pattern is nonnegative and flux- 
preserving. 

These amplitude modulating masks may be implemented 
using a spatial light modulator (SLM) or placing chrome on 
quartz. The SLMs may be preferable in video settings where 
the underlying scene contains motion and using a different 
mask pattern at each time step boosts performance. However, 
in order for the proposed approach to work, the mask or 
SLM used must be higher resolution than the FPA. Currently, 
very high resolution SLMs are still in development. Chrome 
on quartz masks can be made with higher resolution than 
many SLMs, but cannot be changed on the fly unless we 
mount a small number of fixed masks on a rotating wheel or 
translating stage. The uniform amplitude modulation masks 
in particular could be constructed using a high-resolution 



halftoning procedure, which is easiest to implement at the 
necessary resolution with chrome on quartz. 

Both Robucci et al. 071 and Majidzadeh et al. ||38| have 
proposed performing the analog, random convolution step in 
complementary, metal-oxide-semiconductor (CMOS) electron- 
ics. A clear advantage to this architecture is that the additional 
optics required for spatial light modulation are removed in 
favor of additional circuitry, immediately reducting imager 
size. 

2) Phase Shift Masks: Phase shifting masks for coded 
aperture imaging have been implemented recently using a 
phase screen [39|. This approach allows one to account for 
diffraction in the optical design. However, depending on 
the precise optical architecture, phase shift masks may be 
much less photon-efficient than amplitude modulation masks. 
Additionally, the mask generation distribution described in 
Eq. ( [T3] > will result in negative entries for the corresponding 
PSF 7r^ p = J 7 ~ 1 a. To compensate for this, the phase mask 
must be mean-shifted to make all entries nonnegative, so that 
we are actually implementing 



c{h 



UP 



min(/ l UF )l), 



(16) 



with the constant c selected so that the implementable PSF 
/i^ p is flux-preserving and 1 is a matrix of ones and of the 
same dimension as /i UP . 

3) Implementable Masks for Scenes with Sparse Gradients: 
As described in Section |III-A[ theoretical results for scenes 
with sparse gradients hold for the sparse difference operator 

V defined in ( fT2) ). The problem with this approach is that V 
is invertible but not circulant - and hence the sensing matrix 
AV cannot be implemented with a coded aperture system. 
We address this by noting that if the upper right element of 

V were —1 instead of 0, then the resulting sensing matrix, 
denoted V, could be implemented physically and is a close 
approximation to the theoretically supported V. In short, the 
theoretically supported solution is to set 



y = AVf* + w 
9 = arg min | 



l "y-A0\\l 



while an implementable close approximation is to set A G = 
AV and 



y = A c r + w 



f = argmm|||y 
/ 



i|| y -A G /||2 + r||/|| 



TV. 



where || • ||tv is the total variation seminorm which causes / 
to have sparse gradients. In our numerical experiments we 
compare the performance of TV regularization to sparsity 
penalization for the task of reconstructing static scenes. 

4) Downsampling Implementation: In developing RIP- 
satisfying AMM coded apertures, Theorem [2] assumes the 
subsampling operation selects one measurement per d\ x c?2 
block. From an implementation standpoint, this operation 
effectively discards a large portion {{d — l)/d) of the available 
light, which would result in much lower signal-to-noise ratios 
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at the detector. A more pragmatic approach is to use larger 
detector elements that essentially sum the intensity over each 
d\ x d 2 block, making a better use of the available light. We 
call this operation integration downsampling to distinguish it 
from subsampling. The drawback to this approach is that we 
lose many of the desirable features of the system in terms of 
the RIP. Integration downsampling causes a large coherence 
between neighboring columns of the resulting sensing matrix 
A. An intermediate approach would randomly sum a fraction 
of the elements in each size d block, which increases the 
signal-to-noise ratio versus subsampling, but yields smaller 
expected coherence. This approach is motivated by the random 
demodulation proposed in [28 1 and described in Sec III-B 



whereby the signal is multiplied by a random sequence of signs 
{ — 1,+1}, then block-wise averaged. The pseudo-random 
summation proposed here can be thought of as an optically 
realizable instantiation of the same idea where we multiply 
by a random binary {0, 1} sequence. We explore the effects 
of these choices in our numerical results section. 

5) Noise and Quantization: While CS is particularly useful 
when the FPA needs to be kept compact, it should be noted 
that CS is more sensitive to measurement errors and noise than 
more direct imaging techniques. The experiments conducted 
in this paper simulated very high signal-to-noise ratio (SNR) 
settings and showed that CS methods can help resolve high 
resolution features in images. However, in low SNR settings 
CS reconstructions can exhibit significant artifacts that may 
even cause more distortion than the low-resolution effects 
associated with conventional coded aperture techniques such 
as MURA. 

Similar observations are made in ll40ll . which presents a 
direct comparison of the noise robustness of CS in contrast to 
conventional imaging techniques both in terms of bounds on 
how reconstruction error decays with the number of measure- 
ments and in a simulation setup; the authors conclude that for 
most real-world images, CS yields the biggest gains in high 
signal-to-noise ratio (SNR) settings. Related theoretical work 
in ATI show that in the presence of low SNR photon noise, 
theoretical error bounds can be large, and thus the expected 
performance of CS may be limited unless the number of avail- 
able photons to sense is sufficiently high. These considerations 
play an important role in choosing the type of downsampling 
to implement. 

Similar issues arise when considering the bit-depth of focal 
plane arrays, which corresponds to measurement quantization 
errors. Future efforts in designing optical CS systems must 
carefully consider the amount of noise anticipated in the 
measurements to find the optimal tradeoff between the focal 
plane array size and image quality. 

D. Reconstruction 

To solve the CS minimization problem Q, we use well- 
established gradient-based optimization methods. They are 
particularly suitable in our setting because the block-circulant 
structure of A described in the previous section allows for 
very fast matrix-vector products that are critical to the speed 
of their performance. However, our observation matix A is not 



zero mean and, therefore, negatively impacts the performance 
of these CS-based reconstruction algorithms. In this section, 
we describe how we take full advantage of the structure of A 
and address how we mitigate the negative effects of A having 
nonzero-mean. 

1 ) Sparsity -promoting Methods: We note that most popular 
and effective methods for performing the above reconstruction 
are iterative algorithms with repeated applications of the oper- 
ators A and (A) T to scene estimates and residuals [42 J, [43 1. 
In most CS settings, computing these matrix-vector multiplica- 
tions is a large computational burden. However, because of the 
circulant structure of A, this computation is very efficient using 
fast Fourier transform algorithms. In particular, we compared 
the computation time for reconstructing a 256 x 256 scene 
using 128 x 128 CCA measurements with the time required 
to reconstruct the same scene using the same number of 
measurements computing using a dense random projection 
matrix; our reconstruction was 250 times faster because of 
our exploitation of the Toeplitz structure in A. 

2) Mean Subtraction: Generative models for random pro- 
jection matrices used in CS involve drawing elements inde- 
pendently from a zero-mean probability distribution [1|, [44|, 
l32l . Il35ll . 11451 . and likewise a zero-mean distribution was 
used to analyze the coded aperture masks described in Sec. Ill 
However, as coded aperture masks with zero mean are not 
physically realizable in optical systems, we generate our phys- 
ically realizable masks from an appropriately scaled Bernoulli 
distribution with values {0, 1/n}. This shifting ensures that 
the coded aperture corresponds to an implementable (i.e., 
nonnegative and intensity preserving) mask pattern which does 
satisfies the assumptions needed to model photon propagation 
through an optical system. 

While necessary to accurately model real-world optical sys- 
tems, this shifting negatively impacts the performance of well- 
established I2-I1 reconstruction algorithms for the following 
reason. Several of these algorithms {e.g., [42 1, [43 1) assume 
that the £ 2 data fidelity term 4>(f) = h\\Af - y\\\ (propor- 
tional to the negative log Gaussian likelihood) is such that 
V 2 (f)(x) = A T A can be well-approximated by al, a > 0. This 
assumption is crucial to the performance of these algorithms. 

If we collect measurements using a realizable mask with 
elements in the range [0, 1/n], the resulting matrix A does 
not satisfy this condition due to the mean offset. Therefore in 
the reconstruction we use a mean-shifted sensing matrix 

A® = A — (lj)l m xn, 

where /i^ = (l/mn)l^Al„ is the mean value of A. This 
new matrix is such that (A°) T A° « al is a valid assumption. 
However, to use this matrix in the reconstruction, we need to 
adjust our estimate accordingly. To compensate for this, we 
decompose our estimate / into its mean component /i / and a 
zero-mean deviation f°: f = f° + Hfl-n- So then 

Af = (A + /MlmxnX/ + M/ln) 

= A°f + nfi Af i f l m + (ifA°l n + n A {l T n f)l m 

w A°f +nn A n f l m . 

The data y can be decomposed in a similar fashion, so that 
V = y° + Mylm- Thus a straightforward estimate for the mean 
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Fig. 2. Best performing reconstructions for the different downsampling 
strategies (integration downsampling, random summation, and subsampling) 
and for each downsampling strategy, the different mask distributions (binary 
spatial, uniform spatial, and uniform phase). For each set of images, the 
bottom shows the region indicated by the yellow square in the ground truth 
image. Clearly, significant gains in performance can be achieved by the 
compressive architectures over conventional imaging. 



of our solution is Hf = p, v /p,An, and we can use the zero- 
mean quantities in our data fidelity term 



fO 



Ay 



V 



(17) 



within the iterative algorithm, which now has the desired 
property that V 2 0°(/°) w al. 

We have motivated this mean subtraction approach from an 
estimation setting where we first estimate the mean, then devi- 
ations about the mean. However, it can be equally motivated 
from a pure optimization perspective by viewing the mean 
subtraction step as a preconditioning strategy. The gradient- 
based reconstruction methods considered are known to exhibit 
poor convergence when the Hessian has large condition num- 
ber, and the mean subtraction operation essentially improves 
the conditioning of the problem by eliminating the single 
dominant eigenvalue that occurs due to the nonzero mean. 

E. Experimental Comparison on Static Images 

Here we present numerical experiments supporting the 
proposed architectures for snapshot compressive imaging. We 
compare the simulated performance of the described imaging 
architectures in the task of high-fidelity image reconstruction. 
We utilize a 256 x 256 pixel realistic phantom image of a neu- 
ron (ground truth image in Fig. [2]) which is designed to test the 
resolution limits of our architectures with its high-resolution 
features. This image is inspired by a real microscopic image 
found at http://www.lsi.umich.edu/facultyresearch/labs/bingye/ 
research We examine a traditional imaging system and a total 
of six convolution-based compressive imaging systems. These 
systems are constructed by considering each combination of 
the following design considerations: 

• the mask generation method, including binary spatial 
(BS) and uniform phase (UP) models, 

• switching among integration downsamping, pseudo- 
random summation (randomly summing d/2 values in 
each di x ^2 block), and subsampling. 

To analyze the effect of the number of measurements we 
collect, we vary the scene-to-sensor downsampling ratio d to 
be either 4, 16, or 64 (i.e., downsampling in both directions 
by a factor of 2, 4, or 8). While each architecture has its own 
unique considerations in terms of signal-to-noise efficiency 



(see Section III-C I, we choose to normalize the experiments 
in such a way as to keep a constant signal-to-noise ratio at 
the detector. This is achieved by fixing the variance of the 
additive white Gaussian noise to be a 2 = var(A/)/16 for 
each architecture when we choose a downsampling ratio of 
d = 4. This variance is then fixed for each architecture as 
we vary the downsampling ratio, the rationale being that this 
allows us to showcase the impact of the various subsampling 
operations. 

We consider a comparison of many penalization schemes 
for image reconstruction, such as sparsity in an orthonormal 
wavelet (Haar) basis, isotropic and anisotropic total-variation 
1 46], as well as an l\ sparsity penalty in the overcomplete 
curvelet basis [47 1. Curvelets are similar to wavelets in that 
they capture more spatially localized information than the 
Fourier components, however they are designed to be more 
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well-adapted for capturing curvilinear singularities in images 
(e.g., edges). To reconstruct the image, we use our own 
implementation of the SpaRSA algorithm [42 1 which utilizes 



the mean subtraction procedure described in Section III-D2 A 
coarse initialization is found by solving an unpenalized least- 
squares minimization via a conjugate gradient method using 
only the compressive data. We terminate the reconstruction 
when the relative change in the iterates falls below a pre- 
specified tolerance of le-3. We choose the regularization 
weighting to minimize the final reconstruction RMSE, mea- 
sured by RMSE(/) = ||/ - /*|| 2 /||/*||2- A table of the 
resulting RMSE values is presented in Table [I] 

As evidenced by both the RMSE values and the images, 
significant gains in performance can be achieved by the 
compressive architectures. The UP mask architectures slightly 
outperform the BS-based approaches in this simulation, but 
the type of subsampling has a greater effect on the quality 
of the reconstruction. Focusing on the d = 4 case, we see 
that subsampling performs best overall, followed by random 
summation, then integration downsampling. This is readily 
apparent from the RMSE values and the greater clarity by 
which we can resolve fine-scale features in the dendrites of 
the neuron. This is exactly predicted by the theory: larger 
coherence yields poorer performance. However, pushing the 
level of undersampling, we see that simple subsampling is 
not always the solution, since the performance degrades quite 
rapidly, as the per-measurement signal to noise ratio does 
not naturally increase as it does with the other subsampling 
architectures. An optimal architecture must strike a careful 
balance between low coherence and robustness to noise. 

IV. Coded Aperture Keyed Exposure Sensing for 
Dynamic Scenes 

Here we detail our compressive video acquisition method 
that combines coded apertures and keyed exposures to address 



all the competing challenges detailed in Sec. I-A Specifically, 
in our CAKE imaging method, each observed frame y k is 
given by an exposure of B high-rate coded observations: 



Uk 



- E A(k-l)B+tf*k-l)B+t 



Wk, 



(18) 



t=i 



where each A t describes an AMM CCA sensing matrix. Note 
that since in our theory, the downsampling operator D su b is a 
structured nonrandom operator, we can rewrite the above as 



Vk = A 



ub 



e R{k-\)B+tf*k-i)B+t 



(19) 



Hence all that is required to implement this sensing paradigm 
is modulating the coded aperture mask over the _B-frame 
exposure time. Because of this, one can think of our system 
as performing a coded exposure acquisition for each coded 
aperture element. 

Since our sensing strategy is independent across each low- 
resolution frame, and also for simplicity of presentation, we 
only consider the recovery of a length B block of frames from 
a single snapshot image in our theoretical analysis. 



Theorem 4 (Coded Aperture Keyed Exposure Sensing). Let 

A = [A\ ■ ■ ■ Ab] be an m x nB sensing matrix for the CAKE 
system where the coded aperture pattern for each A t is drawn 
i.i.d. from a suitable probability distribution. Then there exists 
constants C\ , Ci > depending on 5 S such that for any 

m>c 1 s 2 log(nB), (20) 

A is RIP(s, 5 S ) with probability exceeding 



1 - 2n 2 B 2 e- C2m/s2 



(21) 



Note here that s denotes the sparsity of the first B frames 
of the video sequence, rather than simply the sparsity of 
an individual frame as in Thm. [2] The proof of Thm. [4] 
is presented in Appendix [B] Similar to Thm. [2] this proof 
assumes that the entries of each h t , t — 1, are generated 

i.i.d. according to the scaled Rademacher distribution 
Again, it is straightforward to extend the result to other mask 
generating distributions. 

A. Sparse Transformation Sensing 

It is most often the case that the original frames /* are 
not sparse, but can be sparsely represented by a temporal 
transform of the frames. For simplicity, consider the first coded 
measurement. Within this acquisition, the coefficient sequence 



YWk, t ft, k = i,. 



t=i 



B. 



may be more sparse for a well-chosen sparse temporal trans- 
form W. Notationally this is equivalent to 6* = {W ® I n )f* , 
where <E> denotes the Kronecker matrix product. A preferred 
sensing strategy would be to use our RIP-satisfying CAKE 
acquisition to sense the coefficient sequence 9*: 



B 



A k 6t 



(22) 



fc=l 

Surprisingly, this can be accomplished using an identical 
architecture, with some slight adjustment to the coded aperture 
mask patterns used during the keyed acquisition. 

To see this, we examine the resulting sensing matrix in terms 
of /*: 



B 

E 

fe=i 



B 

A k 9* k = J2 A k^W k , t f* = 



k=l 
B 



t—1 



B 

E 

*=i 



B 

E 

Lfc=l 



W k , t A k 



ft 



ft, 



where we define 



A-7 — E W k ,tA k . 

k=l 



(23) 



Therefore ( |22| > is also a CAKE acquisition using the sensing 
matrices A^ . Because of the linear dependence on the gen- 
erating mask patterns, this amounts to using the an identical 
architecture with adjusted mask patterns 



hf = Y,w k , t h k . 

k=l 



(24) 
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TABLE I 

Summary of the RMSE values for the different mask generation and downsampling schemes, including the conjugate gradient 

INITIALIZATION, AND THE VARIOUS RECONSTRUCTION METHODS CONSIDERED. THE IMAGES ASSOCIATED WITH THE BEST PERFORMING METHODS 

INDICATED BY THE BOLD RMSE VALUES ARE SHOWN IN FlG.[2] 



If we denote h w = (h^ , . . . , h^), then this can be written 
more simply as h w = {W T ®I n )h = (W®I n ) T h. Hence the 
CAKE system can adapt to any temporal sparse coding of the 
video sequence over the block of coded frames. In summary, 
to incorporate the sparse transform W applied to the frames, 
we simply apply the adjoint transform to the independently 
generated mask patterns. 

A useful transformation that allows the exploitation of 
dependencies between frames is to assume that the difference 
between subsequent frames are sparse. In this case, we select 
W — V, as defined in ( |12) , In particular, we sense using the 
coded sequence of aperture patterns /i v where 



hi 



h k -h k+1 fc=l, .. 
hs k = B, 



,B-1, 



and the generating aperture patterns h are drawn from a 
suitable distribution. 

It is of interest to note that since the CAKE sensing matrix 
is a concatenation of downsampled Toeplitz matricies, this 
sensing strategy has clear connections to ll48l where they 
consider concatenations of Toeplitz matricies as a sensing 
matrix for performing multiuser detection in wireless sensor 
networks. The important conceptual link is that their sensing 
matrix is used to determine a sparse set of simultaneously 



active users, where in our system, we are using it to infer a 
sequence of sparse frames, or a sequence of sparse difference 
frames. 



B. Implementation and Normalization Details 

Recall that these RIP-satisfying sensing matrices are gen- 
erated using a zero-mean probability distribution and cannot 
be directly implemented in an optical system. As before, we 
apply an affine transformation to each of the A k so that the 
resulting mask elements for each h k are within the range 
[0,1/ti]. If we are generating a sensing matrix that satisfies 
RIP with respect to the difference frames, then even a binary 
generating distribution will cause the resulting mask patterns 
to have elements that are neither fully open nor fully closed. 
This can be accomplished using half-toning as discussed in 



Sec. III-C If such an implementation is difficult, then one 
strategy would be to set these intermediate values to or 1/n 
at random with equal probability. 



C. Reconstruction for Video 

Given measurements from the proposed CAKE imaging 
system, where we have designed the mask patterns for sparsity 
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Fig. 3. Example frame (frame 21) from the ground truth video sequence 
used in the numerical experiments. 



in a temporal transform, we recover the frames by solving 

9 = argmin \\\A6 - y\\ + t||0||i 
6 

f= (W- l ®I n )9. 

For the case of sparse difference frames (W = V), improve- 
ments in empirical performance are made by penalizing the 
total variation of the first frame, instead of simply the sparsity 
in that frame: 



6 = arg min ~ \ \ AO 
e 

/ = {L®I n )d, 



Try f 1 ||tV 



k=2 



here L — V -1 is a lower triangular matrix of all ones. 

In addition, by using more than one low-rate observation 
per reconstruction step, we are typically able to improve 
performance by coding the difference frames across more than 
one length-S block of high-rate frames. This is the reconstruc- 
tion technique we use in the experimental results section. In 
previous work, we noticed that when the amount of processing 
time allotted per frame is held constant, the accuracy generally 
increases with the number of frames processed simultaneously. 
However, one simply cannot solve for arbitrarily many frames 
simultaneously, as the improvement in accuracy diminishes 
when the size of the problem is such that only a very small 
number of reconstruction iterations can be run within the 
allotted time. We refer the reader to [26] for details. 

For video sequences of longer duration, we may wish to 
process the video in an online fashion, solving for only a few 
blocks of frames simultaneously. In many applications, such 
as surveillance and monitoring, the video frames are strongly 
correlated, and the solution to a previous block of frames may 
be used as an initialization to the algorithm to solve the next 
block of frames. This estimate will generally be very accurate, 
and therefore, relatively few iterations are needed to obtain a 
solution to each optimization problem. 

D. Numerical Experiments 

In this section, we demonstrate the effectiveness of the pro- 
posed CAKE architecture at successfully recovering a video 
sequence of short-wave infrared (SWIR) data collected (cour- 
tesy of Jon Nichols at NRL) by a short-wave IR (0.9 — 1.7/im) 



camera. The camera is based on a 1024 x 1280 InGaAs 
(Indium, gallium arsenide) focal plane array with 20/im pixel 
pitch. Optically, the camera was built around a fixed, f/2 
aperture and provides a 6° field of view along the diagonal 
with a focus range of 50m —> oo. Imagery were output at the 
standard 30Hz frame rate with a 14 bit dynamic range. An 
example frame is shown in Fig. [3] In this sequence, the three 
boats are traveling at different velocities with respect to the 
slowly-changing background of the waves. The size of each 
frame is 128 x 256, and we consider reconstructing 28 frames 
of the sequence. 

We consider CAKE observations where we downsample 
spatially by a factor of 2 in both directions (di = g?2 = 2), 
and use a coded exposure block length of B = 4. We compare 
acquiring the sequence with independent mask codes, and 
mask codes designed to exploit the sparse difference frames 
directly. We reconstruct the entire video sequence using all 7 
low-resolution low-rate frames using a total variation penalty 
on the first frame, and l\ sparsity penalty on all subsequent 
difference frames. We optimize the regularization parameters 
to minimize the reconstruction error. For comparison, we 
consider traditionally captured data (i.e., by simply averaging 
over di x d 2 x B blocks of the spatiotemporal video sequence). 
To interpolate this data to the original resolution of the 
video sequence, we consider both nearest-neighbor and spline 
interpolation. 

We show the estimates for the different acquisition and 
reconstruction methods in Fig. [4] Here we focus only on a 
ROI of the sailboat. We see that the CAKE sensing is able 
to reconstruct the scene with a higher spatial and temporal 
resolution. This is evidenced in the residuals, which include 
much less critical scene structure than the conventional system. 
We see from the examination of the difference frame that the 
nearest-neighbor reconstruction from conventionally sampled 
data yields no motion over the two frames and suffers from 
poor spatial resolution. Using spline interpolation helps im- 
prove the spatial resolution, but it is still insufficient to recover 
the scene with high temporal resolution, as can be seen in the 
blur on the leading edge of the sail. Numerically we quantify 
the performance over the entire video sequence in terms of the 
RMSE (%), 100 • ||/ - /*|| 2 /||/*||2, calculated both over the 
entire frame, and only over the sailboat ROI. This is tabulated 
in Table [TT] In summary we see that the CAKE acquisitions 
are able to outperform traditionally sampled video in terms of 
reconstruction accuracy and reconstructing salient motion. 

It should be noted that there are limitations to the CAKE 
system in the presence of strong motion. In this case, the 
sparsity level of the difference frames may drastically increase 
as the previous frame ceases to be a good prediction of the 
next frame. As such, the RIP bound in Theorem [4] states the 
number of measurements we require to reconstruct the scene 
must necessarily increase, requiring either a faster temporal 
resolution measurements or higher resolution FPA to achieve 
the same accuracy. Because of this balance, a system designer 
may need to make important engineering tradeoffs to imple- 
ment CAKE acquisition for a particular application. 
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Frame 2 1 Frame 22 



Difference Residuals Residuals 
Frame Frame 2 1 Frame 22 




Truth 



Conventional 
Nearest-neighbor 
Upsampling 

Conventional 
Spline 
Upsampling 

CAKE 
Difference-frame 
Mask Codes 

CAKE 
Independent 
Mask Codes 



Fig. 4. Results obtained for the sailboat ROI for an example pair of frames 
(frames 21 and 22). Shown are the true frames, the reconstruction from 
traditionally sampled data, and the reconstruction from CAKE sensed data. For 
comparison we show the residuals as compared with the truth, as well as the 
difference between the frames. Note that using nearest neighbor interpolation 
results in no motion over the block of frame, hence a zero difference frame. 





Reconstruction RMSE (%) 


Sensing Architecture 


Full 


Frame 


Sailboat ROI 


Conventional (Nearest-neighbor) 


5.5163 


(5.5305) 


10.3525 (10.4241) 


Conventional (Spline) 


3.7654 


(3.8047) 


5.9335 ( 6.0377) 


CAKE (Difference-frame Codes) 


3.5840 


(3.9183) 


5.6971 ( 7.0421) 


CAKE (Independent Codes) 


2.9079 


(3.0932) 


4.3266 ( 4.9604) 



TABLE II 

Reconstruction RMSE achieved for the conventional and 

CAKE ARCHITECTURES OVER THE VIDEO SEQUENCE. RESULTS ARE 
REPORTED BOTH FOR THE FULL FRAME AND THE ROI OF THE SAILBOAT. 
DUE TO BOUNDARY ISSUES, WE REPORT THE RMSE DISCOUNTING THE 

FIRST AND LAST BLOCK OF B FRAMES. THE RMSE VALUES FOR THE 
ENTIRE SEQUENCE ARE GIVEN IN PARENTHESES. 



V. Conclusions 

Compressed sensing offers a strong theoretical foundation 
for sparse signal recovery. However, practical and imple- 
mentable imaging system designs based on CS theory have 
lagged far behind. In this paper, we demonstrate how CS 
principles can be applied to physically realizable optical 
system designs, namely coded aperture imaging. Numerical 
experiments show that CS methods can help resolve high 
resolution features in images and videos that conventional 
imaging systems cannot. We have also demonstrated that our 
CAKE acquisition system can recover video sequences from 
highly under-sampled data in much broader settings than those 
considered in initial coded exposure studies. However, our de- 
rived theoretical limits show that there are important tradeoffs 
involved that depend on the spatial sparsity, and temporal 
similarity of the scene that ultimately govern the accuracy 
of our reconstructions for a specified number of compressive 
measurements. Finally, we note that our proposed approach 
performs well in high signal-to-noise ratio settings, but like 



all CS reconstructions, it can exhibit significant artifacts in 
photon-limited settings. 

Two practical issues associated with coded aperture imaging 
in general are the blur due to misalignment of the mask and 
diffraction and interference effects. Noncompressive aperture 
codes have been developed to be robust to these effects [49|. 
One important avenue for future research is the development 
of compressive coded aperture makes with similar robustness 
properties. Noncompressive coded apertures have also been 
shown useful in inferring the depth of different objects in a 
scene; similar inference may be possible with the compressive 
coded apertures described in this paper. 
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Appendix 

A. Proof of the RIP for Spatial-Domain CCA Sensing 

Here we present the proof that appropriately scaled com- 
pressive coded apertures satisfy the RIP. 

Proof of Theorem |2f In the interest of notational sim- 
plicity, it is easier to work on the two-dimensional images 
versus their one-dimensional vectorial representations. For 
concreteness, we assume the entries of h are generated i.i.d. 
according to the scaled Rademacher distribution 




with probability 1/2, 
with probability 1/2, 



for ki = 1, . . . , n\ and k-i — 1, . . . , n^. The proof uses the 
same techniques as that of Theorem 4 in ll36l . where the RIP 
is established by shifting the analysis of the submatrices of 
A to the entries of the Gram matrix G = A T A by invoking 
Gersgorin's disc theorem |f50l . This theorem states that the 
eigenvalues of aniixn complex matrix G all lie in the union 



of n discs dj(cj,rj) 
with radius 



j = 1, 2, . . . , n, centered at Cj 



Gn 



In essence, we show that with high probability G sa /, so that 
the eigenvalues of G are clustered around one with suitably 
high probability. 

From the discussion above, we have that 

Al,k = ^mod[(ii-l)di-fci+l,ni] + l,mod[(; 2 -l)(22-fc2+l,ri2] + l! 

so the entries of the resulting Gram matrix G are 



G,, 



m/di n 2 /d,2 

E E 



2l = I '2 



(25) 



lmod[(Zi-l)dl-pi + l,ni] + l,mod[(2 2 -l)d 2 -p 2 + l,n2] + l' 



"mod[(2 1 -l)d 1 -<; 1 +l,n 1 ] + l,mod[(Z2-l)ci 2 -<?2 + l,n2] + l- 

From the normalization introduced for h, we can show that 
E[G] = /. Now we need to bound the deviation about the 
mean via concentration. We consider first the diagonal entries 
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of G, when p = q. Each term is a sum of n = n\n 2 bounded 
i.i.d. entires, and applying Hoeffding's inequality yields 

P(|G M -1| ><5 d )<2expp^L 

Next we consider the off-diagonal entries in the special case 
that either mod(pi — gi, d\) 7^ or mod(j>2 — Q2,d 2 ) = 0. In 
this case, each of the terms in the summand in ( |2~5j ) picks out 
a different set of coefficients from h, and hence there are no 
dependencies between different terms in the sum, hence it is 
also a sum of n = n\n-z bounded i.i.d. entries, and thus from 
Hoeffding's inequality, we have 



P(|G P ,,| >S /s) <2exp 



-n5l 
2ds 2 



Lastly, we have the case that both mod(pi — q\,dx) = and 
mod(p2 — <Z2, d 2 ) = with p 7^ q. This case deserves special 
care since each of the terms in the summand in <j25j picks out 
the same set of coefficients from h. Therefore there necessarily 
exist dependencies within the terms of the sum. However, due 
to the special nature by which we select the coefficients, we 
can partition the sum G Pi9 into two sums, denoted Si and S 2 
such that each of these is a sum of n/2d independent terms. 
We can then apply the Hoeffding bound to each of these to 
yield 

P(|#| >5 Q /2s) <2exp( ^-f 
Then applying the union bound gives us 



1,2. 



P(|G,, g |>5 /.s)<4ex Pl |;j 



^n6 2 
Ads 



Since this latter bound decays more slowly, and in the interest 
of simplicity, we overbound using this latter expression. What 
remains is to now apply the union bound over all the diagonal 
and off-diagonal elements. For this we need to count the 
number of elements in the union. For the diagonal entries, 
there are clearly n elements. And since the entries of G are 
symmetric, we only have n(n— l)/2 remaining terms. Hence 
taking 5 d = 5 D = 5 s /2, 



¥(A does not satisfy RIP(s,(5 5 )) 
( -n5 2 \ 

< 2nexp f — ^ J + 4[n(n - l)/2] exp 

-n5 2 



-not 
16ds 2 



< [2n + 2n(n - 1)] exp 



IQds 2 



2n exp 



16ds 2 



< 2n exp 



-c 2 n 
~ds Y ~ 



where c 2 < S 2 /16. If n > 2, this probability is less than one 
provided n/d > cis 2 log(n) where c\ > 3/c2, noting that 
m — n/d establishes the theorem. ■ 

B. Proof of RIP for CAKE 

Proof of Theorem |?J This section details the proof of 
Theorem [4] Our strategy here is the same as that of a single 
frame, we bound the entries of the Gram matrix. First note 



that in this case the Gram matrix has a certain block structure; 
since A — [A\ ■ ■ ■ As], we have 



G = A A 



AJA 1 A\A 2 
AU 1 



AJAb 
A\A B 



A T B A % 



A b Ab 



This block structure allows us to utilize the results established 
in Appendix [A] For the diagonal elements of G, we simply 
use the same bound for one particular block AjAk, since this 
is the situation for a sensing a single frame. Therefore 



P 



\G g , g -l\ 



> Sd) < 2 exp 



-2nd 2 



For the off-diagonal entries of G, we need to consider the 
off-diagonal entries of AjAk and any entry of A^Ai, k 7^ I. 
Recall that in the Appendix [A] we simply used the worst- 
case slower-decaying bound when the off-diagonal entry of the 
Gram matrix was no longer a sum of all independent terms. 
Here we use the same overbound and hence have 

f-n5 2 " 

P(|G M |>a /*)<2exp^_| 

Lastly, we need to perform a union bound over all the entries 
of the matrix. We have nB diagonal entries, and exploiting 
symmetry, we only have nB(nB — l)/2 remaining entries. 
Hence taking Sd = S = S s /2, 

P(A does not satisfy RIP(s, 5 S )) 
f-n8 2 \ 

< 2nBexp ( ) + A[nB{nB - l)/2] exp 



2<7 



< [2nB + 2nB(nB - 1)] exp 



-nS 2 s 
16ds 2 



-nS 2 
16ds< 



= 2n 2 B 2 exp 



( 



V I6ds 2 



< 2n 2 B 2 exp 



-c 2 n 
~d^~ 



where c 2 < 8 2 /16. If nB > 2, this probability is less than one 
provided n/d > c\s 2 \og(nB) where c\ > 3/c 2 , again noting 
that m = n/d establishes the theorem. ■ 

References 

[I] E. Candes, J. Romberg, and T. Tao, "Robust uncertainty principles: Exact 
signal reconstruction from highly incomplete frequency information," 
IEEE Trans. Inform. Theory, vol. 52, no. 2, pp. 489 - 509, 2006. 

[2] D. L. Donoho, "Compressed sensing," IEEE Trans. Inform. Theory, vol. 
52, no. 4, pp. 1289-1306, 2006. 

[3] M. Shankar, R. Willett, N. Pitsianis, T. Schulz, R. Gibbons, R. T. Kolste, 
J. Carriere, C. Chen, D. Prather, and D. Brady, "Thin infrared imaging 
systems through multichannel sampling," Applied Optics, vol. 47, no. 
10, pp. B1-B10, 2008. 

[4] A. F. Coskun, I. Sencan, T.-W. Su, and A. Ozcan, "Lensless wide-field 
flourescent imaging on a chip using compressive decoding of sparse 
objects," Optics Express, vol. 18, no. 10, pp. 10510-10523, 2010. 

[5] L. C. Potter, E. Ertin, J. T. Parker, and M. Cetin, "Sparsity and 
compressed sensing in radar imaging," Proceedings of the IEEE, vol. 
98, no. 6, pp. 1006-1020, 2010. 

[6] J. Gu, S. K. Nayar, E. Grinspun, P. N. Belhumeur, and R. Ramamoorthi, 
"Compressive structured light for recovering inhomogeneous participat- 
ing media," in Proceedings of the European Conference on Computer 
Vision, 2008. 

[7] M. Mohtashemi, H. Smith, D. Walburger, F. Sutton, and J. Diggans, 
"Sparse sensing dna microarray-based biosensor: Is it feasible?" in 
2010 IEEE Sensors Applications Symposium, Feb 2010, pp. 127-130. 



14 



[8] M.A. Sheikh, 0. Milenkovic, and R.G. Baraniuk, "Designing compres- 
sive sensing dna microarrays," in 2nd IEEE International Workshop on 
Computational Advances in Multi-Sensor Adaptive Processing, 2007, 
December 2007, pp. 141-144. 

[9] M. Lustig, D. Donoho, and J. M. Pauly, "Sparse MRI: The application 
of compressed sensing for rapid MR imaging," Magnetic Resonance in 
Medicine, vol. 58, no. 6, pp. 1182-1195, December 2007. 
[10] A.C. Gurbuz, J.H. McClellan, and W.R. Scott, "A compressive sensing 
data acquisition and imaging method for stepped frequency GPRs," 
IEEE Transactions on Signal Processing, vol. 57, no. 7, pp. 2640 - 
2650, July 2009. 

[11] P. Ye, J.L. Paredes, G.R. Arce, Y. Wu, C. Chen, and D.W. Prather, 
"Compressive confocal microscopy," in IEEE International Conference 
on Acoustics, Speech and Signal Processing, 2009, April 2009, pp. 429 
-432. 

[12] J. Bobin, J.-L. Starck, and R. Ottensamer, "Compressed sensing in 
astronomy," Selected Topics in Signal Processing, IEEE Journal of, vol. 
2, no. 5, pp. 718 -726, October 2008. 

[13] M. F. Duarte, M. A. Davenport, D. Takhar, J. N. Laska, T. Sun, K. F. 
Kelly, and R. G. Baraniuk, "Single pixel imaging via compressive 
sampling," IEEE Sig. Proc. Mag., vol. 25, no. 2, pp. 83-91, 2008. 

[14] J. G. Abies, "Fourier transform photography: a new method for X-ray 
astronomy," Proceedings of the Astronomical Society of Australia, vol. 
1, pp. 172, 1968. 

[15] R. H. Dicke, "Scatter-hole cameras for X-rays and gamma-rays," 
Astrophysical Journal, vol. 153, pp. L101-L106, 1968. 

[16] E. Caroli, J. B. Stephen, G. Di Cocco, L. Natalucci, and A. Spizzichino, 
"Coded aperture imaging in X- and gamma-ray astronomy," Space 
Science Reviews, vol. 45, pp. 349403, 1987. 

[17] G. Skinner, "Imaging with coded-aperture masks," Nucl. Instrum. 
Methods Phys. Res., vol. 221, pp. 3340, Mar. 1984. 

[18] R. Accorsi, F. Gasparini, and R. C. Lanza, "A coded aperture for high- 
resolution nuclear medicine planarimaging with a conventional anger 
camera: experimental results," IEEE Trans. Nucl. Sci., vol. 28, pp. 241 1- 
2417, 2001. 

[19] G. R. Gindi, J. Arendt, H. H. Barrett, M. Y. Chiu, A. Ervin, C. L. 

Giles, M. A. Kujoory, E. L. Miller, and R. G. Simpson, "Imaging with 

rotating slit apertures and rotating collimators," Medical Physics, vol. 

9, pp. 324-339, 1982. 
[20] S. R. Meikle, R. R. Fulton, S. Eberl, M. Bahlbom, K.P. Wong, and M. J. 

Fulham, "An investigation of coded aperture imaging for small animal 

SPECT," IEEE Trans. Nucl. Sci., vol. 28, pp. 816-821, 2001. 
[21] A. Levin, R. Fergus, F. Durand, and W. T. Freeman, "Image and depth 

from a conventional camera with a coded aperture," in Proc. of Intl. 

Conf. Comp. Graphics, and Inter. Tech., 2007. 
[22] A. Agrawal and Yi Xu, "Coded exposure deblurring: Optimized codes 

for psf estimation and invertibility," in IEEE Computer Vision and 

Pattern Recognition, 2009. 
[23] R. Raskar, A. Agrawal, and J. Tumblin, "Coded exposure photography: 

Motion deblurring using fluttered shutter," in ACM Special Interest 

Group on Computer Graphics and Interactive Techniques (SIGGRAPH), 

2006. 

[24] A. Veeraraghavan, D. Reddy, and R. Raskar, "Coded strobing pho- 
tography: Compressive sensing of high speed periodic videos," IEEE 

Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 

4, pp. 671-686, 2011. 
[25] R. F. Marcia and R. M. Willett, "Compressive coded aperture superres- 

olution image reconstruction," in Proc. IEEE Int. Conf. Acoust., Speech, 

Signal Processing (ICASSP), 2008. 
[26] R. F. Marcia and R. M. Willett, "Compressive coded aperture video 

reconstruction," in Proc. 16th Euro. Signal Process. Conf., 2008, 

Lausanne, Switzerland, August 2008. 
[27] R. F. Marcia, Z. T. Harmany, and R. M. Willett, "Compressive coded 

apertures for high-resolution imaging," in Proc. 2010 SPIE Photonics 

Europe, Brussels, Belgium, April 2010. 
[28] J. Romberg, "Compressive sampling by random convolution," SIAM 

Journal on Imaging Sciences, vol. 2, no. 4, pp. 1098-1128, 2009. 
[29] N. J. A. Sloane and M. Harwit, "Masks for Hadamard transform optics, 

and weighing designs," Appl. Opt., vol. 15, no. 1, pp. 107-114, 1976. 
[30] A. Ashok and M. A. Neifeld, "Pseudorandom phase masks for 

superresolution imaging from subpixel shifting," Appl. Opt., vol. 46, 

no. 12, pp. 2256-2268, 2007. 
[31] S. R. Gottesman and E. E. Fenimore, "New family of binary arrays for 

coded aperture imaging," Appl. Opt., vol. 28, no. 20, pp. 43444352, 

1989. 

[32] E. J. Candes and T. Tao, "Decoding by linear programming," IEEE 
Trans. Inform. Theory, vol. 15, no. 12, pp. 4203 4215, 2005. 



[33] E. Candes, "Compressive sampling," in Int. Congress of Mathematics, 

2006, vol. 3, pp. 1433-1452. 
[34] E. J. Candes, J. K. Romberg, and T. Tao, "Stable signal recovery from 

incomplete and inaccurate measurements," Communications on Pure 

and Applied Mathematics, vol. 59, no. 8, pp. 1207-1223, 2006. 
[35] W. Bajwa, J. Haupt, G. Raz, S. Wright, and R. Nowak, "Toeplitz- 

structured compressed sensing matrices," in Proc. of Stat. Sig. Proc. 

Workshop, 2007. 

[36] J. D. Haupt, W. U. Bajwa, G. M. Raz, and R. D. Nowak, "Toeplitz 
compressed sensing matrices with applications to sparse channel esti- 
mation," IEEE Transactions on Information Theory, vol. 56, no. 11, pp. 
5862-5875, 2010. 

[37] R. Robucci, J. D. Gray, L. K. Chiu, J. Romberg, and P. Hasler, 
"Compressive sensing on a CMOS separable-transform image sensor," 
Proceedings of the IEEE, vol. 98, no. 6, pp. 1089-1101, 2010. 

[38] V. Majidzadeh, L. Jacques, A. Schmid, P. Vandergheynst, and 
Y. Leblebici, "A (256*256) pixel 76.7mW CMOS imager/compressor 
based on real-time in-pixel compressive sensing," in IEEE International 
Symposium on Circuits and Systems, 2010, pp. 2956-2959. 

[39] W. Chi and N. George, "Phase-coded aperture for optical imaging," 
Optics Communications, vol. 282, pp. 2110-2117, 2009. 

[40] J. Haupt and R. Nowak, "Compressive sampling vs. conventional 
imaging," in Proc. IEEE Intl. Conf. on Imaging Processing, 2006, pp. 
1269-1272. 

[41] M. Raginsky, R. M. Willett, Z. T. Harmany, and R. F. Marcia, "Com- 
pressed sensing performance bounds under Poisson noise," IEEE Trans. 
Signal Process., vol. 58, no. 8, pp. 39904002, 2010. 

[42] S. Wright, R. Nowak, and M. Figueiredo, "Sparse reconstruction by 
separable approximation," IEEE Trans. Signal Process., vol. 57, pp. 
2479-2493, 2009. 

[43] M. A. T. Figueiredo, R. D. Nowak, and S. J. Wright, "Gradient 
projection for sparse reconstruction: Application to compressed sensing 
and other inverse problems," IEEE J. Set Top. Sign. Proces.: Special 
Issue on Convex Optimization Methods for Signal Processing, vol. 1, 
no. 4, pp. 586-597, 2007. 

[44] J. Haupt and R. Nowak, "Signal reconstruction from noisy random 
projections," IEEE Trans. Inform. Theory, vol. 52, no. 9, pp. 4036- 
4048, 2006. 

[45] R. G. Baraniuk, M. Davenport, R. A. DeVore, and M. B. Wakin, "A 
simple proof of the restricted isometry property for random matrices," 
Constructive Approximation, vol. 28, no. 3, pp. 253-263, 2008. 

[46] L.I. Rudin, S. Osher, and E. Fatemi, "Nonlinear total variation based 
noise removal algorithms," Physica D: Nonlinear Phenomena, vol. 60, 
no. 1-4, pp. 259-268, 1992. 

[47] E. J. Candes and D. L. Donoho, "New tight frames of curvelets and 
optimal representations of objects with piecewise C 2 singularities," 
Communications on Pure and Applied Mathematics, vol. 57, no. 2, pp. 
219-266, 2004. 

[48] L. Applebaum, W.U. Bajwa, M.F Duarte, and R. Calderbank, "Asyn- 
chronous code-division random access using convex optimization," 
Accepted to Elsevier Phy. Commun., 2011. 

[49] J. W. Stayman, N. Subotic, and W. Buller, "An analysis of coded aperture 
acquisition and reconstruction using multi-frame code sequences for 
relaxed optical design constraints," in Proceedings SPIE 7468, 2009. 

[50] R. Varga, Gersgorin and His Circles, Springer- Verlag, Berlin, Germany, 
2004. 



PLACE 
PHOTO 
HERE 



night vision. 



Zachary T. Harmany Zachary Harmany received 
the B.S. (magna cum lade, with honors) in Elec- 
trical Engineering and B.S. (cum lade) in Physics 
from The Pennsylvania State University in 2006. 
Currently, he is a Ph.D. student in the department 
of Electrical and Computer Engineering at Duke 
University. In 2010 he was a visiting researcher at 
The University of California, Merced. His research 
interests include nonlinear optimization, functional 
neuroimaging, and signal and image processing with 
applications in medical imaging, astronomy, and 



15 



Roummel F. Marcm Roummel Marcia received 
his B.A. in Mathematics from Columbia University 
in 1995 and his Ph.D. in Mathematics from the 
University of California, San Diego in 2002. He 
was a Computational and Informatics in Biology 
and Medicine Postdoctoral Fellow in the Biochem- 
istry Department at the University of Wisconsin- 
Madison and a Research Scientist in the Electrical 
and Computer Engineering at Duke University. He 
is currently an Assistant Professor of Applied Math- 
ematics at the University of California, Merced. His 
research interests include nonlinear optimization, numerical linear algebra, 
signal and image processing, and computational biology. 



PLACE 
PHOTO 
HERE 



Rebecca M. Willett (SM '02, M '05, SM '11) is 
an assistant professor in the Electrical and Com- 
puter Engineering Department at Duke University. 
She completed her PhD in Electrical and Computer 
Engineering at Rice University in 2005. Prof. Willett 
received the National Science Foundation CAREER 
Award in 2007, is a member of the DARPA Com- 
puter Science Study Group, and received an Air 
Force Office of Scientific Research Young Investi- 
gator Program award in 2010. Prof. Willett has also 
held visiting researcher positions at the Institute for 
Pure and Applied Mathematics at UCLA in 2004, the University of Wisconsin- 
Madison 2003-2005, the French National Institute for Research in Computer 
Science and Control (INRIA) in 2003, and the Applied Science Research and 
Development Laboratory at GE Healthcare in 2002. Her research interests 
include network and imaging science with applications in medical imaging, 
wireless sensor networks, astronomy, and social networks. 



PLACE 
PHOTO 
HERE 



